
Unraveling the tree of life: A grand challenge for Biology. 


Scott V. Edwards 

Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138 USA 


Abstract 

Building the Tree of Life is an ongoing activity of scientists around the world, one that combines information 
from both the genotype and phenotype of organisms. I review recent trends in this effort and describe a number 
of models, including the multispecies coalescent model, as means to achieve this end. 
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1. Introduction 

Building and studying the Tree of Life (ToL) is 
one of the grand challenges of the biological sciences 
(ToL: Wolf et al. 2002, Cracraft et al. 2004, Delsuc et 
al. 2005, Pace 2009, Forterre 2015, Soltis etal. 2018). 
The ToL is a roadmap to understanding biological 
diversity. An understanding of the details of the 
branches of the tree of life allows researchers to 
understand macroevolutionary trends in diverse traits, 
from ecology (Webb et al. 2002, Geeta et al. 2014, Li 
2016), behavior (Eisenstein et al. 2016), geography 
(Jetz et al. 2011), and adaptation to cancer (Aktipis et 
al. 2015), human health (Wolf et al. 2002) and many 
other issues of high societal relevance (McTavish et 
al. 2017). Some researchers have questioned the reality 
of the ToL, preferring to call it a ‘net of life’ or some 
other term that conveys the large amounts of horizontal 
gene transfer, hybridization and other reticulate 
phenomena that are now known to occur across 
diverse taxa (Doolittle 2003, Booth et al. 2016, Doolittle 
etal. 2016). Indeed, reticulation events in the form of 
hybridization, introgression or horizontal gene transfer 
are common across diverse clades, including eukaryotes 
and vertebrates and ultimately humans However, I and 
many others believe that there is enough consistency 
of phyologenetic signal across diverse data sets to 


suggest that there is a dominant or modal ToL, despite 
rampant discordance due to reticulations (Daubin et 
al., 2003). 

2. Genomes, Phenomes and the Tree of Life 

The genomes of living and extinct organisms 
possess an abundance of data that can be used to 
build and unravel the Tree of Life. There is some 
disagreement among systematists with about the 
extent to which molecular data should dominate the 
search for the ToL, an argument that goes back 
several decades (Patterson et al. 1987). Suggestions 
have been made that molecular characters are the 
best choice for building the ToL, outperforming 
morphological traits in the number of characters, ease 
of delimitation of characters, consistency of signal or 
providing less biased sampling of characters (Scotland 
et al. 2003). However, some systematists, particularly 
those with backgrounds in museum science that 
traditionally have employed morphological characters 
in their classifications, favor continued use of 
morphology in building the ToL, especially with regard 
to inclusion of fossil taxa (Wiens 2004, O’Leary et 
al. 2013). Arguments for the use of morphology in 
building the ToL also derive from the philosophical 
mandate promoted by Arnold Kluge for ‘total 
evidence’ - using all available data to build 
phylogenetic hypotheses (Kluge 1989, Eemisse et al. 
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1993). Indeed, there are many interesting and novel 
approaches to the analysis of morphological data, 
including analysis of continuous traits rather than the 
typical binary traits, some of which likely increase the 
amount of signal available for building the ToL (Wiens 
2001, Parins-Fukuchi 2018). The challenges of 
morphological versus molecular data in building the 
ToL extend to microbial taxa (Harper et al. 2009). 

My personal opinion about the molecules versus 
morphology debate in systematics is that the biggest 
challenge to morphological traits is not their numbers 
or their delimitation but their non-random choice: the 
act of consciously choosing specific morphological 
characters is often non-random, and can be biased 
for or against a given phylogenetic hypothesis in 
ways that make it difficult to study the distribution 
and strength of phylogenetic signal in a meaningful 
way. Some have argued that morphological characters 
show as much (or as little) homoplasy (conflicting 
phylogenetic signal) as do molecular characters, 
thereby suggesting that any biases in their choice 
may not compromise the purity of their signal 
(Sanderson et al. 1989). However, the fact that 
morphological characters are chosen consciously 
suggests that, unknowingly, their average signal could 
be inflated and their noise reduced compared to 
randomly chosen characters, a topic that deserves 
more study. As the scoring of morphological and 
other phenotypic traits becomes more automated, and 
the ontological relationships among traits becomes 
better defined, the possibility of scoring ‘unbiased’ 
morphological traits for phylogenetic ananlysis 
becomes more likely (Deans et al. 2015, Dececchi 
et al. 2015, Thessen et al. 2015, Edmunds et al. 
2016, Wirkner et al. 2017). The era of “phenomics” 
has only just begun for animal diversity (O’Leary et 
al. 2013, Maddison 2016), is more mature for plant 
diversity (Tardieu et al. 2017, Ninomiya et al. 2019) 
and could result in a renaissance of the use of 
morphology in systematics. Some approaches to the 
unbiased selection of morphological characters, such 
as those envisioned in analyses of the fictious but 
scoreable ‘Canimalcules’ of James Rohlf (Rohlf et 
al. 1967), should be revisited with modem methods 
of phenomics. 

For now, genomic characters - with significant 
help from fossil taxa and guidance (if not explicit 
characters) from morphology - seem to be moving 
towards a realization of the ToL at a rapid pace. For 
decades, DNA sequence data has been the workhorse 
of molecular systematics, a paradigm that was 


significantly accelerated in the 1980s and 1990s by 
the advent of the polymerase chain reaction (PCR: 
Kocher et al. 1989, Paabo et al. 1989). Today, whole 
genomes are the norm, dramatically advancing the 
field of comparative genomics. Even as whole 
genome data becomes commonplace, these genomes 
continue to be analyzed with reference primarily to 
the DNA sequence of the interrogated genomes. 

3. Models for phylogenetic analysis of whole 

genome data 

Phylogenetic analyses of genome-wide DNA 
sequences come in two principle flavors: either the 
various genes or other phylogenomic markers are 
concatenated into single “supergenes” for analysis 
(the supermatrix or concatenation method: de Queiroz 
et al. 2007) or, more recently, trees of each gene are 
constructed and subsequently combined according to 
biological models such as the multispecies coalescent 
model to achieve an overarching ‘species tree’ 
(Edwards 2009, Rannala et al. 2020). The term 
species tree is meant to refer to any demographic 
history that consists solely of divergence events, with 
little or no subsequent hybridization or other 
reticulation between lineages. The entities analyzed 
need not be species in the classical sense, but could 
instead simply be populations or representatives of 
higher taxa. The term ‘species tree’ was popularized 
by John Avise to distinguish the phylogenetic 
relationships of species from the phylogenetic patterns 
found in the underlying genes, which, early on, were 
found frequently to vary from gene to gene across 
the genome (Neigel et al. 1986, Avise 2000). The 
heterogeneity in gene trees sampled from a genome 
need not arise out of complex histories of hybridization 
or gene flow, but can instead be a simple 
consequence of rapid speciation accompanied by large 
population sizes, which permit only slow shifts in 
allele frequencies due to drift, with the result that the 
phylogeny of alleles sampled from a tree with species 
A, B and C may not reflect the branching sequence 
of populations A, B and C. Such heterogeneity is 
called incomplete lineage sorting (ILS) when thinking 
forwards in time, from past to present, or deep 
coalescence (DC) when thinking backwards in time, 
from present to past (Maddison 1997, Maddison et 
al. 2006). ILS is now known to be a ubiquitous 
aspect of genomic history and is likely the most 
common source of mismatches between gene trees 
and species trees (Pollard et al. 2006, Edwards 2009, 
Pease et al. 2013). 
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Phylogenetic methods employing the 
multispecies coalescent model (MSC: Degnan et al. 
2009) arose out of early observations of abundant 
gene tree heterogeneity. The MSC is a mathematical 
framework describing the distribution of gene trees 
expected given a species tree with a set of 
phylogenetic relationships and branch lengths. The 
likelihood of a gene tree given a species tree, first 
written down by Rannala and Yang (2003) forms the 
cornerstone of most phylogenetic methods employing 
the MSC. The MSC allows researchers to take a set 
of rooted or unrooted gene trees and estimate an 
overarching species tree. Such methods provide a 
biological model of exceptional breadth, utility and 
elegance to combine data from different genes 
together to estimate phylogenetic tree, even when 
those gene trees differ from one another. MSC 
methods of tree inference are, to my mind, more 
satisfying than earlier ‘supertree’ methods of 
combining data from different genes or character 
sets, primarily because MSC models are informed 
by biological principles and population genetics, as 
opposed to consensus or supertree methods, which 
often combine trees due to arbitrary models (Steel et 
al. 2000, Bininda-Emonds 2005). As of this writing, 
there are numerous MSC methods for building 
phylogenetic trees, ranging from fast, clustering 
methods to likelihood and slower but more accurate 
Bayesian methods (Liu et al. 2009, Liu et al. 2010, 
Chifman et al. 2014, Rannala et al. 2017, Zhang et 
al. 2018, Liu et al. 2019). Although supermatrix 
methods are still used, primarily as a benchmark to 
acknowledge the traditional approach to molecular 
systematics, MSC methods are growing rapidly and 
generally considered superior and more appropriate 
for genome-scale data. In fact, theory suggests that 
supermatrix methods are actually a subset of MSC 
methods, and suffer from the same types of model 
violations that have been identified for MSC methods 
(Liu et al. 2015a, Liu et al. 2015b, Edwards 2016, 
Edwards et al. 2016). A recent study showed that, 
across 47 phylogenomic data sets of widely varying 
taxa, size and complexity, MSC methods did a better 
job at explaining characteristics of the underlying 
sequence data than did supermatrix methods, 
suggesting that the MSC model is more appropriate 
for building the ToL (Jiang et al. 2019). Although 
MSC methods were inspired by observations of gene 
tree heterogeneity, they do not require such 
heterogeneity to be applied appropriately. The key 
assumption is that recombination between loci is free, 


allowing gene trees with varying topologies or branch 
lengths to be observed at different loci. 

Despite the wide applicability and versatility of 
MSC models, clearly more work is needed to develop 
models that capture even more subtle aspects of 
biological realities underlying the ToL (Bravo et al. 
2019). The strict MSC model has a number of 
shortcomings - most notably, the inability to 
accommodate gene flow between species, an 
evolutionary force whose prominence has only 
increased in recent years (Mallet et al. 2015). New 
MSC models that can accommodate gene flow 
between lineages have been emerging and promise 
increased flexibility to account for reticulations in the 
ToL (Stenz et al. 2015, Yu et al. 2015, Solis-Lemus 
et al. 2016, Solis-Lemus et al. 2017). Another 
shortcoming is the assumption of no recombination 
within loci - a challenge for data sets, such as 
transcriptome data sets, whose loci sometimes span 
several tens of kilobases, if not greater genomic 
lengths (Gatesy et al. 2013). Marker choice will 
influence how strongly this assumption is violated 
(Costa et al. 2016, Jennings 2017). For example, 
the use of single nucleotide polymorphisms (SNPs) 
in phylogenetic analysis can circumvent this particular 
recombination assumption, because recombination 
cannot take place within a single site in the genome. 
However, the options available for analyzing SNPs 
data in a phylogenetic context are still limited, although 
some authors suggest success at analyzing such data 
across diverse taxonomic scales (Eaton et al. 2017, 
Leache et al. 2017, Stange et al. 2018, Spriggs et 
al. 2019). Another challenge is the difficulty of 
combining molecular and morphological data in the 
framework provided by the MSC. It is not yet entirely 
clear how to model morphological characters using 
the MSC approach, although important inroads to 
this question are emerging (Mendes et al. 2018). 
Although the MSC has clearly brought the field of 
phylogenetics to a new ‘local optimum’ (Edwards 
2009), we are still far from achieving a global optimum 
that will accommodate the many complexities inherent 
in the ToL (Bravo et al. 2019). 

Another major area in need of further progress 
is the use of rare genomic changes in building the 
ToL (Rokas et al. 2000, Boore 2006, Rokas 2006, 
Boore et al. 2008, Rogozin et al. 2009). Rare genomic 
changes are structural or non-sequence-based 
molecular characters that usually have low homoplasy 
and high phylogenetic information content. Molecular 
characters like gene order, synteny, transposable 
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element insertions, insertions and deletions in protein¬ 
coding genes and chromosomal rearrangements fall 
into this category. The most common type of rare- 
genomic change used in phylogenetics are 
transposable element insertions (Shedlock et al. 
2000), which have been used to great success in 
several clades across the ToL (Grover et al. 2008, 
Rogozin et al. 2009, Churakov et al. 2010, Wu et al. 
2019). The emphasis by the community on DNA 
sequence data in phylogenetics - something I often 
refer to as the ‘dance of nucleotides’ - is limiting, 
because rare genomic changes often offer much more 
compelling evidence for the existence of a given 
clade. Rare genomic changes are not necessarily 
immune to the challenges posed by ILS, introgression 
or other reticulations (Hillis 1999, Suh et al. 2015), 
but they often have higher evidentiary power than do 
the accumulated signal of nucleotide variations in 
DNA sequences. Without reviewing the state of the 
field regarding rare genomic changes, I will simply 
note that in the past, these types of molecular 
characters were very hard won and therefore very 
rarely employed because of the large amount of 
benchwork required to discover and validate them 
(Ellegren 2007, Kaiser et al. 2007). Now that genome 
sequencing has become more routine, rare-genomic 
changes are much easier to discover and analyze, 
and are proving their worth in diverse phylogenetic 
contexts (Schmitz et al. 2016, Cloutier et al. 2019). 


4. Conclusions 

The future of phylogenomics and the quest for 
the ToL seems clear on at least one front: there will 
be a concerted effort to sequence the genomes of all 
life, including the genomes of complex eukaryotic 
orgnaisms (Lewin et al. 2018). Such an effort will 
not only provide a robust genomic foundation for 
building the ToL but will also present unprecedented 
opportunities for discovery in biology and for the 
conservation of biological diversity. Building the ToL 
will not only require large data sets but also intelligent 
analyses of such data sets. More data is not a 
panacea for phylogenetics (Delsuc et al. 2005, 
Philippe et al. 2011); rather, judicious use of available 
markers and appropriate models of evolution will be 
required. Increased dialogue between empiricists, 
theoreticians computer scientists and engineers and 
natural historians with deep knowledge of biodiversity 
will be essential to success. It’s hard to imagine a 
time when the ToL is complete, but the journey 
towards its completion provides excitement enough. 
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